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Abstract 

In this article we focus on Maximum Likelihood estimation (MLE) for the static parameters of hidden Markov 
models (HMMs). We will consider the case where one cannot or does not want to compute the conditional 
likelihood density of the observation given the hidden state because of increased computational complexity or 
analytical intractability. Instead we will assume that one may obtain samples from this conditional likelihood and 
hence use approximate Bayesian computation (ABC) approximations of the original HMM. ABC approximations 
are biased, but the bias can be controlled to arbitrary precision via a parameter e > 0; the bias typically goes 
to zero as e \ 0. We first establish that the bias in the log-likelihood and gradient of the log-likelihood of the 
ABC approximation, for a fixed batch of data, is no worse than 0{ne), n being the number of data; hence, for 
computational reasons, one might expect reasonable parameter estimates using such an ABC approximation. 
Turning to the computational problem of estimating 9, we propose, using the ABC-sequential Monte Carlo 
(SMC) algorithm in [18], an approach based upon simultaneous perturbation stochastic approximation (SPSA). 
Our method is investigated on two numerical examples. 

Key- Words: Approximate Bayesian Computation, Hidden Markov Models, Parameter Estimation, Sequential 
Monte Carlo 



1 Introduction 

Hidden Markov models provide a flexible description of a wide variety of real-life phenomena; see [5] for an overview. 
An HMM is a pair of discrete-time stochastic processes, {-^n}„>o ^nd {i^n}„>i, where Xn G X C M''^ is an 
unobserved process and yn ^ ^ Q K'^" is observed. The hidden process {^n}„>o is a Markov chain with initial 
density ^q{xq) at time and transition density fg (x„|a;„_i), with ^ S O C W^o i.e. 

fe(Xo e A) = I ng{xo)dxo and Pe(X„ e A\Xn-i = Xn-i) = / /e(x„|x„_i)da;„ n>l (1) 

J A J A 

where Fg denotes probability, A g B(X) and dxn is Lebesgue measure. In addition, the observations {i^n}„>i 
conditioned upon {^n}„>o statistically independent and have marginal density gg {yn\xn), i-e. 

VeiYn € B\{Xk}k>o - {xk}k>Q) = / ge{yn\xn)dyn n > 1 (2) 

Jb 

with B e B(Y). The HMM is given by equations ([l])-([2]) and is often referred to in the literature as a state-space 
model. Here 6* is a static parameter, which is to be estimated in using MLE and online as the data arrive; this 
problem has a large range of real applications such as financial modelling or weather prediction. 

Statistical inference from the class of HMMs described above is typically non-trivial. In most scenarios of 
practical interest one cannot calculate the likelihood: 

Pe{yi:n) = j ge{yn\Xn)'^e{Xn\yi:n-l)dXn 

where yi-.n '■— (yi,...,?/„) and ■Ke{xn\yi:n~i) is the predictor; see e.g. [5] for the standard filtering recursions. 
Hence as the likelihood is not analytically tractable, one must resort to numerical methods, not only to compute 
it, but to maximize pg{yi:n) w.r.t. 9. When 9 is known, a popular collection of techniques for both estimating 
the likelihood as well as performing filtering and smoothing are sequential Monte Carlo methods e.g. [15]. SMC 
techniques simulate a collection of N samples in parallel, sequentially in time and combine importance sampling 
and resampling to approximate a sequence of probability distributions of increasing state-space known pointwise 



up-to a multiplicative constant. These techniques provide a natural estimate of the likelihood. The estimate is 
quite well understood and is known to be unbiased TO] and, in addition, the relative variance is known to increase 
linearly with n [TUSH], n being the number of data. When is unknown, as is the case here, estimation of is 
complicated by the path-degeneracy problem of SMC methods; e.g. [2^. However, there are still many specialized 
SMC techniques which can successfully be used for online parameter estimation of HMMs in a wide variety of 
contexts, such as [T31I15]. Most of these techniques require the evaluation of ge{y\x) and potentially the gradient 
vectors as well. 

In this article, we consider the scenario where geiynlxn) is either intractable, in the sense that one cannot 
calculate it or an unbiased estimator of it, or one does not want to calculate the density, potentially due to the 
high-dimensionality of Xn- It is assumed that one can sample from ge{yn\xn)dyn- In this case, one cannot use 
standard or the more advanced SMC methods that are mentioned above (or indeed many other techniques) and 
hence exact online parameter estimation is difficult to achieve. One approach which is designed to deal with this 
problem are ABC techniques; see e.g. [20]. Whilst there are a number of other competitors [T7], we focus upon ABC 
ideas; see [171 EHj for some discussion of the relative merits of ABC against competing methods. In the context of 
HMMs there has been some work on the construction of ABC approximations of HMMs [HIES], computational 
techniques for filtering and smoothing [151 HIl ISj and their statistical consistency for parameter estimation [51 H] . 
ABC approximations of HMMs are biased, but the bias can be controlled to arbitrary precision via a parameter 
e > 0; the bias typically goes to zero as e \ 0. At present there is not a methodology which can achieve our 
objective of online parameter estimation. In this article we do the following: 

1. Investigate the bias in the log- likelihood and the gradient of the log- likelihood that is induced by the ABC 
approximation for a fixed data set. 

2. Develop an SMC approach with cost 0{N) that allows one to estimate the static parameters in an online 
fashion. 

In order to estimate the parameters one must obtain numerical estimates of the log-likelihood and gradient of this 
quantity. It is then important to understand what happens to the bias of the ABC approximation of these latter 
quantities, as the time parameter (number of data, n) grows. We establish, under some assumptions, that this 
ABC bias, for both quantites is no worse than ©(ne); this result is associated to the theoretical work in [51 Hj. 
These former results indicate that the ABC approximation is amenable to numerical implementation: parameter 
estimation will not necessarily be dominated by the bias; we discuss why this is the case in Remarks |2.1| and |2.2| 
For 2. we introduce an SMC approach based upon SPSA |28| to estimate the parameters in an online manner (see 
also |27j in the context of HMMs) . This methodology can be expected to 'work well' when: 

• dj. is large and dg, dy are small to moderate. 

Whilst these statements are somewhat delicate (e.g. what is large), in the scenario of high-dimensional states, it 
has been established in [3] that the simulation error does not explode in the dimension. As a result, the ideas here 
can be seen as principled competitors (and related to - see [IJ) to ensemble kalman filter-based algorithms such as 
in [ig. 

This paper is structured as follows. In Section [2] we discuss the model and ABC approximation. Our bias result 
is also given. In Section [3] our computational strategy is outlined. In Section |4] the method is investigated from a 
computational perspective. In Section [5] the article is concluded with some discussion of future work. The proofs 
of our results can be found in the appendix. 



2 Model and Approximation 
2.1 Model and Estimation 

Consider first the joint filtering or smoothing density of the HMM given by 

f I s fJ-0{xo)I\k=i9e{yk\xk)fe{xk\xk-i) 

Tre[XO:n\yi:n) - 7 7 r^ffi ? 1 VTl 1 V; 

Jx„+i fJ-eixo) l[k=i 9e(yk\xk)Je(Xk\xk-i)dxo:n 
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where 9 ^ Q Q W^" is the static parameter, a;„ G X are the hidden states and ?/„ € Y the observations. This quantity 
can be computed recursively using 

7r0(xo:fc|yi:fc-i) = / TT0{xo.,k-i\yi:k-i)fe{.Xk\xk-i)dxk (3) 

7re(a;o:fe 2/i:fc) = 7 — , ^ (4) 

Ve{Vk\y\:k~\) 

with the recursive likelihood being 

P9iyk\yi:k-i) = / 9eiyk\xk)'^e{xo:k\yi:k-i)dxk (5) 
Jx 

Furthermore we write the log- (marginal) likelihood at time n: 

n 

log(w(2/i:„)) = ^log{p0{yk\yi:k-i))- 

k=l 

In the context of MLE one is usually interested computing 

9 = argmaxlog(p9(2/i.„)) 

6e& 

Note that this is a batch or off-line method, which means that one needs to wait first to collect the complete 
dataset and then compute the ML estimate. For a long observation sequence the computation of the gradient at 
each iteration of the algorithm can be prohibitive. Therefore, one uses on-line methods whereby the estimate of the 
parameter is updated sequentially as the data arrives. A practical alternative would be to consider the following 
update scheme at time k, for some sequence {afe}fc>i 

Ok+i = dk +ak+i Vlog(pe(2/fc|j/i:fe-i))le=e^ ■ 

Upon receiving yk, the parameter estimate is updated in the direction of ascent of the conditional density of this 
new observation. The algorithm in the present form is not suitable for on-line implementation due to the need 
to evaluate the gradient of '^ogpe{yk\yo:k-i) at the current parameter estimate which would require computing the 
filter from time to time k using the current parameter value 9k. 

A recursive ML (RML) algorithm bypassing this problem has been proposed in the literature when X is finite 
in [15]. It relies on the following update scheme 



7fe+l — 



9k + ak+iV\og{pe,yj^{yk\yi:k-~i)) 



where the positive non-increasing step-size sequence {aA:}fc>i satisfies ^}^ak = 00 and X]fc '^fe < ^ tt^i ®-§- 
a-k = fc"" for 0.5 < a < 1. The quantity \ogpe^,^(yk\yi:k-i) is defined as 

V log {p0^.^ {yk\yi:k-i)) = V log (peo:fc (yi:fc)) - V log (yi:fc-i)) 

where the notation V log (pg^.^. (j/i:^)) indicates that at each time k the quantities in ([3|-([5| are computed using the 
parameter estimate 9k. The asymptotic properties of this algorithm (i.e. the behavior of 9k in the limit as k goes 
to infinity) have been studied in [12] for a finite state-space HMM. It is shown that under regularity conditions 
this algorithm converges towards a local maximum of the average log-likelihood; this average log-likelihood being 
maximized at the 'true' parameter value. 

In this article, we would like to implement approximate versions of these on-line and off-line ML schemes when 
both the following cases hold: 

• Case 1: We can sample from the conditional distribution of Y\x, for any fixed 9 and x. 

• Case 2: We cannot or do not want to evaluate the conditional density of Y\x, ge{y\x) and do not have access 
to an unbiased estimate of it. 

Apart from using likelihoods which do not admit computable densities such as some stable distributions, this context 
might appear relevant to the context when one is interested to use SMC methods and evaluate gg{y\x) when dx is 
large. SMC methods for filtering do not always scale well with the dimension of the hidden state dx , often requiring 
a computational cost 0{k'^''), with k > 1; see e.g. [3l|4]. A more detailed discussion on the difficulties of using SMC 
methods in high dimensions is far beyond the scope of this article, but we remark the ideas in this paper can be 
relevant in this context. 
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2.2 ABC Approximation and Noisy ABC 

To facilitate statistical inference, we consider an ABC approximation of the joint smoothing density (e.g. [THl 1^ ): 

fJ^eixo) rife^i KgAyk\uk)g0{uk\xk)fe{xk\xk-i) 



7r(9,e(wi :ni ^0:n|yi;n) 



Jx^+ixY" f^e{xo)Y[l=i KeAyk\uk)ge{uk\xk)fe{xk\xk-i)dui;nXo:r. 



where u„ € Y are pseudo observations, e > Kg : Y x x Q ^ U {0} is some kernel function that has 
bandwidth that depends upon a precision parameter e > 0. Examples include: 

Ke,e{yk\uk) = I{u:|yfc-«|<e}("/c) 

KgAyk\uk) = (j)dy{yk;uk,eld^) 

where I is the indicator function, | ■ | is the Li— norm, ^, S) is normal density on d— dimensions with mean ^ 
and covariance E and Id is the d— dimensional identity matrix. 
Consider the quantity, to be used below: 

. I s !yKeAyk\uk)9e{uk\xk)duk 

geAyk\xk) = 7 — „ , I — ^1 — , — 1— • (6) 

Jy2 Kg^^{yk\uk)ge(Uk\xk)aukdyk 

Throughout the article we critically choose Kg ,:{yk\uf^) such that the denominator of (|6]) does not depend upon 
Xk or 6. As noted in [5], after integrating out the this representation leads to a new (or perturbed) HMM 

with transitions fg and likelihoods ge^e- Parameter estimation associated to the smoother Tr^^e just considers the 
function: 



^Og{p6,e{yi--n)) ^^^Og{Pe,e{yk\yi:k-l)) 



fc=l 



where 



PeAyk\yi:k-i) = / geAyk\xk)'^eAxk\yi:k~i)dxk- 



X 



We term the maximizer of pe,e{yi:n) as the ABC-MLE. One can then define a RML procedure for the ABC-HMM 
as in Section [2m 

^'fe+i =0k + ak+iV log{pgAyk\yl■.k-l)}■ 
^a practice, one can consider an estimation oi peAyi-.n) including factors independent of 0,e; this is discussed in 
Section [31 

Results on associated to the asymptotics of the ABC-MLE (i.e. as n grows) can be found in [5112]; there is an 
asymptotic bias. In addition, in the case of noisy ABC, where the data become corrupted, there is no asymptotic 
bias and one can recover the true parameter. We remark that the methodology that is considered in this article 
can easily incorporate noisy ABC. However, there may be several reasons why one may not want to use noisy ABC: 
(1) the consistency results (currently) depend upon the data originating from the original HMM; (2) the current 
simulation-based methodology may not be able to push e towards zero. For (1), if the data do not originate from 
the HMM of interest, it has not been studied what happens with regards to the asymptotics of noisy ABC for 
HMMs. It may be that some investigators might be uncomfortable with assuming that the data originate from the 
exactly the HMM being fitted. For (2) the asymptotic bias (which is under assumptions cither 0{e) or O(e^) |8|[9]) 
could be less than the asymptotic variance (under assumptions 0{e^) 111 [9]) as e could be much bigger than 1 when 
using current simulation methodology. We do not use noisy ABC in this article, but acknowledge its fundamental 
importance with regards to parameter estimation associated to ABC for HMMs; our approach is pragmatic, taking 
into account points (l)-(2). 



2.3 Result 

We now prove an upper-bound on the bias induced by the ABC approximation on the log-likelihood and gradient 
of the log- likelihood. The latter is more relevant for parameter estimation, but the mathematical arguments are 
considerably more involved for this quantity, in comparison to the ABC bias of the log-likelihood. Hence the log- 
likelihood is considered as a simple preliminary result. These results are to be taken in the context of ABC (not 
noisy ABC) and help to provide some guarantees associated to the numerics. 
We consider the scenario 

KeAyk\uk) = lA,,„^(Mfe) 
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where the set A^^y^^ is specified below. Throughout | • | is understood to be an Li— norm. The hidden-state is assumed 
to he on a compact set, i.e. X is compact. We use the notation 7'(X) to denote the class of probability measures 
on X and A^(X) the collection of finite and signed measures on X. || • || denotes the total variation distance. The 
initial distribution of the hidden Markov chain is written as /ig G "/^(X). In addition, we condition on the observed 
data and do not mention them in any mathematical statement of results (due to the assumptions below). We do 
not consider the instance of whether the data originate, or not, from a HMM. For the control of the bias of the 



gradient of the log-likelihood (Theorem 2.1 1, we assume that dg = 1. This is not restrictive as one can use the 
arguments to prove analgous results when > 1, by considering componentwise arguments for the gradient. In 
addition, for the gradient result, the derivative of pig is written Jig G A4[X). We make the following assumptions, 
which are extremely strong. They are made to keep the proofs as short as possible. 

(Al) Lipschitz Continuity of the Likelihood. There exist L < +oo such that for any x G X, y, y' e Y, G 8 

\9e{y\x) - g0{y'\x)\ < L\y-y'\. 



(A2) Statistic and Metric. The set A^^y is: 



Ae,y = {u -.{y -u\ < e}. 



(A3) Boundedness of Likelihood and Transition. There exist < C < C < -|-cxi such that for all G X, y G Y, 

6* G e 

C<fe{x'\x)<C, 
C<gg{y\x)<C. 

(A4) Lipschitz Continuity of the Cradient of the Likelihood. fe{x'\x), geiylx') are differentiable in 9 for each 
a;, G X, y G Y. In addition, there exist L < +oo such that for any x G X, y, y' G Y, 6* G 9 

\V{ge{y\x)}~W{ge{y'\x)}\<L\y-y'\. 

(A5) Boundedness of Cradients of the Likelihood and Transition. There exist < C < C < +oo such that for all 
x, x' G X, y G Y, 6* G e 

C<W{fe{x'\x)}<C, 
C<W{ge{y\x)}<C. 

We first have the result on the ABC bias of the log-likelihood. The proof is in appendix [B] 

Proposition 2.1. Assume (Al-3). Then there exist a C < +oo such that for any n > 1, fig G ^(X), e > 0, 6* G 8 
we have: 

|log(pe(yi:„)) - log(pe,£(yi:n))| < Cne. 

Remark 2.1. The above proposition gives some simple guarantees on the bias of the ABC log-likelihood. When 
using SMC algorithms to approximate log(pe(yi:„)) , the overall error will be decomposed into the deterministic bias 



that is present from the ABC approximation (that in Proposition 2.1) and the numerical error of approximating 
the log-likelihood. Under some assumptions, the L2 — error of the SMC estimate of the log-likelihood should not 
deteriorate any faster than linearly in time; this is due to the results cited previously. Thus, as the time parameter 
increases, the ABC bias of the log-likelihood will not necessarily dominate the simulation-based error that would be 
present even if gg is evaluated. 



Proposition |2.1| is reasonably straight-forward to prove, but, is of less interest in the context of parameter 
estimation, as one is interested in the gradient of the log-likelihood. We now have the result on the ABC bias of 
the gradient of the log-likelihood. The proof in appendix [C| 

Theorem 2.1. Assume (Al-5). Then there exist a C < +00 such that for any n > 1, ^g £ V{X), Jig G Ai{X), 
e > 0, 6 G Q we have: 

|V{log(pe(yi:„))} ~ V{log(pe..(2/i:„))}| < Cne(2 + H^I^Ii). 
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Remark 2.2. The above Theorem again provides some explicit guarantees when using an ABC approximation along 
with SMC-based numerical methods. For example, if one can consider approximating gradients in an ABC context 
(see iSlf ). then from the results of [T^ , one expects that the variance of the SMC estimates to increase only linearly 
in time. Again, as time increases the ABC bias does not necessarily dominate the variance that would be present 
even if gg is evaluated (i.e. one uses SMC on the true model). 



Remark 2.3. The result in Theorem 2.1 can be found in eq. (72) of and direct limit (as e \i 0) in J^. However, 
we adopt a new (and fundamentally different) proof technique, with a substantially clearer proof and an additional 
result of independent interest is proved. We derive the stability w.r.t. time of the bias of the ABC approximation of 
the filter derivative; see Theorem \D.l\ in appendix\^ 

3 Computational Strategy 
3.1 SMC 

In order to perform online parameter estimation, we will need to use a SMC algorithm to approximate pe,e(yfc|yi:fe-i) 
for 6 fixed; this is a critical quantity that we will use below. An algorithm which can do this is the SMC approach 
in [18 which is detailed in Figure [Tj with proposals {qk.e}i<k<n with density w.r.t. Lebesgue measure. 

On the basis of Figure [Tj one can approximate pe,t{yi:n)^ up-to a constant that is independent of 9, as follows. 
In an abuse of notation, we denote this SMC estimate (which does not include factors that do not depend on 9) as 
Pe^eiVi-.n)- The SMC estimate is 

N 



N ^ 

k=\ i=\ 

with 

vlMyv.k^^) = \Yy^t^- 
1=1 

These estimates are unbiased for any > 1 (see TIP). In practice, we are interested in the log-likelihoods; taking log- 
arithms of the above estimates generally leads to a biased approximation of log{pe^c(2/i:„)} and \o^\pQ^J(yk\y\±-\)\. 
One can implement a form of bias correction, using the Taylor series expansion ideas in I25i . Throughout, we use 
the bias-corrected estimates: 

l0g{Pe,e(yi:«)} = l0g{p^,(2/l:«)}+ ^P^e(2/l:n)"^ 

log{p^,(yfc|2/i:fc-i)} = log{p^,(7;fe|yi:fc_i)} + ^p^,(yfc|2/i:fc„i)-2. (7) 

The parameter e can be computed adaptively; see |18j . It is remarked that a drawback of this algorithm is that 
when dy grows with e, iV fixed, one cannot expect the algorithm to work well for every e; typically one must increase 
e to yield reasonable algorithmic results and this is at the cost of increasing the bias. To maintain e at a reasonable 
level, one must consider more advanced strategies which are not investigated here. 

One final point, which is often useful in practice. One can modify the ABC approximation to: 



7re,e(w}^„,...,uj^„,a;o:„|2/i:„) oc /^^(xo) J]^ (]g X! -^«>^(2/fe|"i-)) IIseKl^^fe) feWW-i) 

fc=i L j=i j=i 



which yields the same bias as the original ABC approximation (on integrating the u variables) but can yield 
substantial computational improvements. This is because as M grows one approximates a marginal SMC that does 
not sample the auxiliary u variables. 

Remark 3.1. We note that, suppressing 9, if the HMM can be written in the form: 

Yn = UXn,Wn) n>l 
Xn = (fn{Xn~l,Vn) n>l 

where Xq ^ Xq Cz X is known, G Y, y„ G X with {Ki}n>i i-i.d. Wn G Y with {VF„}„>i i.i.d. and independent of 
{yn}n>i 0-nd ^„ : X X Y Y, (p„ : X X X — > X. Suppose that: 
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Step 0. For i = 1, . . . ,iV sample Xq i.i.d. from ^g(xQ)dxQ. Set Wg = ^^^^ i G {Ij • ■ • i-^}- Set 

A: = 0. 



• Step 1. Resample particles from 

TV 



which are also denoted {x^ }, and set = j^. Set fc = + 1 and if fc = n + 1, stop. 
For i = l,...,iV, sample X^, fron 

Compute 



• Step 2. For i = l,...,iV, sample xj^!^ from (?fc,e(a;fe|a:[,*lj^)(ia;fc and t/^*'' from the likelihood gg{uk\x^^^)duk. 



Qk.eixkWkLi) 

renormalize the weights and return to Step 1. 

Figure 1: SMC Algorithm for ABC target. 

• One can evaluate the densities of Wn and V„ and sample from the associated distributions. 

• One can evaluate ^„ (resp. ipn) pointwise, for each n > I and X^, Wn (resp. Xn-i, Vn)- 

One can construct a 'collapsed' (see JEBI) -ABC approximation (assuming Kg^^{y\u) = I^^^(u), Ay^^ = {w G Y : 
d{u,y) < e}, with d a distance metric onY) 

n 

Tre{wi;n,Vi;n\yi:n) OC ^Ay^ ,S^k{f^''\xo, Vi;k) , Wk))p{wk)p{vk) ■ 

k=l 

Hence a version of the SMC algorithm in Figure^ can be derived which does not need to sample from the dynamics 
of the data. In additon one does not need access to the transition density of the hidden Markov chain. This 
representation, however, does not always apply. 

3.2 SPSA 



Recall the RML procedure in Section 2.1 where ge{y\x) is not intractable: 



Ok+i = dk + Ok+isJ \og{peo.^{yk\yi:k-i)) (8) 

for {a„} a sequence of step-sizes. In practice, one does not know the gradient and must resort to (e.g.) SMC 
techniques to approximate it; see for example [55]. In our ABC context one can run the algorithm in Figure [l] to 
approximate the ABC filter. To recursively update 9, at least using the ideas in [521, O'^^ to evaluate 

log(5e(y|a:)) and ^ \og{ge{y\x)) (9) 

which we will not have access to. 

We propose the following computational scheme; the idea is to use SPSA, which does not require the quantities 
in ([9]). Introduce a decreasing sequence of positive numbers {c^.}. Suppose, with {0^} as in the update, ([s]), we 
have 

Vfc,afe>0 afe,Cfc^-0 ^ = 00 ^ < 00. 

k k 

Start with some initial guess 6q and perform the standard SMC update (i.e. as in Figure [T]) for two sets of 
particles. One with parameter: 

9o + cqAq 

and the other with parameter: 

6*0 - coAq 
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where Aq is a d^— dimensional vector with each entry ±1 Bernoulh distributed (see |1S]). For both algorithms 

compute log(peo+coAo,e(yi)) and log(p6io-coAo,c(yi)) respectively, where the estimates are the bias-corrected versions 
as in equation Q. To obtain the next parameter estimate, in the dimension, take 

o+coAo,e (yi)) - iog(p (yi)) 

fl.i — fo.j + Oi T . 

At any subsequent time-point, with 9^ and perform the standard SMC update for two sets of particles. One 
with parameter: 

dk + Ck^k 

and the other with parameter: 

9k - CfeAfc 

For both algorithms compute \og{pe^+c^,AkAyk\yi:k-i)) and log(pefc-CfcAt,e(yfe|i/i:fc-i)) To obtain the next parameter 
estimate, in the i"*— dimension, take 

a a ^ log(p0,+c,A,,£(yfc|2/i:fc-i)) -log(pe,_c,A,,e(yfe|j/i:fe-i)) 

fk+l.i — fk,i + flfc+l T • 

This algorithm does not require one to evaluate ge or its gradient. We refer the reader to [25] and [37] for a 
theoretical justification of this procedure. 



4 Numerical Simulations 

We consider two numerical examples that are designed to investigate the accuracy and behaviour of our numerical 
algorithms. In order to do this, we do not consider scenarios where gg is intractable. 



4.1 Linear Gaussian Model 

We consider the following linear Gaussian HMM, with Y = X = M: 

Xn — 'f'Xn-1 + <7vVn, 

with W„, Vn independent and Wn A/'(0, 1), Vn A/'(0, 1). In the subsequent examples, we will use a simulated 
dataset obtained with 9 = {av,4>,a-w) — (0.2,0.9,0.3). 



4.1.1 Offline MLE 

We begin by considering a small data set, of n = 1000 data points. The offline scenario is the one for which we can 
expect the best possible performance of the ABC-SMC; if we cannot obtain reasonable parameter estimates in this 
scenario we would not expect ABC to be useful in practice. We are concerned with obtaining offline ABC-SMC 
estimates 

a r\ n r\ , logbe,+c. A, (2/i:iooo)) - log(p0,-c. A, (2/i:n)) 
0,+i(z) = e,{^) + a,+i ^-^^ , 

where j is the iteration, 9j{i) is the parameter estimate in the i*''-dimension, and Aj{i) is the i*''-entry of the 
Bernoulli distributed vector. For the SPSA stepsizes, we chose cj = j""'^, aj = 1 for j < 10000, and aj = 
{j - 10000)"°-^ for j > 10000. The iteration consists of running the ABC-SMC algorithm for 1000 data-points, 
with the current value of 9. 

In Figure [2] we compare offline estimates of the following cases: 

(a) Kalman Filter (KF) with SPSA 

(b) SMC on the true model using N = 1000, with SPSA 

(c) ABC-SMC using N = 200, Af = 10, e = 0.1, with SPSA 

(d) Maximum Likelihood estimates (MLE) from an offline grid search optimization. 

In this particular test case, we can observe good relative performance of the ABC-SMC procedure, with regards to 
estimating parameters. This strong performance allows us to investigate a slightly more challenging scenario. 
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-KF 

-SMC 

- ABC-SMC 

-MLE 



0.8 1 1.2 

Iteration 



Figure 2: A typical run of the offline parameter estimates obtained by the KF, SMC, and ABC-SMC for the Unear 
Gaussian HMM, along with ther parameters' offline MLEs. 



4.1.2 Online MLE 

We now consider a larger data set with n = 50, 000 data points, simulated with the previously indicated parameter 
values. We use the online SPSA method described in Section 3.2 The SMC (i.e. on the true model) and ABC-SMC 



algorithms were employed with the same N (and M, e for ABC-SMC) as in the offline case, and the SPSA sequences 
are similar to their offline forms, in Section [4.1.f[ 

We ran fifty independent runs of the each algorithm considered in the previous Section. In Figure [3| we plot 
the medians and credible intervals for the 25-75% and 5-95% percentiles of the parameter estimates (across the 
independent runs). The Ok converge after k = 20000 time steps, with the KF and SMC yielding similarly valued 
estimates. We observe increased variance from left to right in Figure [3j which we attribute to the randomness 
of SMC and ABC-SMC respectively. In particular, the expected reduced accuracy of ABC-SMC against SMC is 
apparent, but, the bias does not appear to be substantial (for ABC-SMC) in this particular example. 




(a) Kalman 



(b) Sequential Monte Carlo 



(c) SMC-ABC 



Figure 3: Credible intervals for the 5-95% and 25-75% percentiles, and the medians for multiple runs of online 
parameter estimates streamed by the KF, SMC, and ABC-SMC for the linear Gaussian HMM. 
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4.2 Lorenz '63 Model 



4.2.1 Model and Data 

We now consider the following non-linear state-space model with X = Y = M'^. The original model is such that 
hidden process evolves deterministically according to the Lorenz '63 system of ordinary differential equations, 



dt 

dXt{3) 
dt 



Xt{l)Xt{2)~l3Xt{3). 



where we recall that the arguments Xt{j) are the j*'*— dimension at time t; where t is continuous here. We modify 
the model to one such that the hidden process is a discrete-time Markov chain with stochastic dynamics: 

X„ = U{X^-i) + Vn, n>l 

where /„ is the 4*''-order approximation Runge Kutta solution to the Lorenz '63 system, Vn A/'(0, rld^) and Xq 
is taken as known. Here r is used to represent the time-discretization. 
For the observations: 

K„ = HX,, + QWn, n > 1 

where Wn Af{0,ldy), W„ is independent of Vn and Q is the Cholesky root of a Toeplitz matrix defined by the 
parameters k and a as follows: 

Qtj = '^S {K^^mm{\i- j\,dy - |i - j|)) , ijG {1, ...,dy} 

S{z) = 



1- \z^, < z < 1 



0, z>\ 



and 



I 0, i^j 

When 6 — {K,a,aG3, p, 13) = (2.5 2, 10, 28, |), n = 5000 and r = 0.05, a visua lisation of the Lorenz '63 (hidden) 
dynamics is shown in Figure 4(a) and the associated simulated dataset in|4(b)[ 




(a) Hidden Markov xi:5ooo (b) Observed data ?/i;5ooo 

Figure 4: Evolution of the 3-dimensional Lorenz '63 HMM in Section [472] 

For the simulated dataset in Figure [4(b) [ we use ABC-SMC to obtain online parameter estimates for 9 and we 
study the performance of these estimates under different settings. We will use 0^^^ to denote the estimate of 9 at 
time n, that was estimated using N particles, M pseudo-observations and a Gaussian kernel with covariance ^Idy- 
We will compare the behaviour of the algorithm as each of N, M, n, e varies. 
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4.2.2 Numerical Results 



We now examine the performance of the algorithm with N E {100, 1000, 10000}. For each value of N, we ran fifty 
independent runs of ABC-SMC, using M = 10 and e = 1. In Figures [5(a)||5(d)| we plot boxplots of the terminal 



parameter estimates, 6*1^50001 against their true values marked by dotted green lines. In Figures 5(e)|5(h) we plot 



the absolute value of the Monte Carlo (MC) bias (that is, the absolute difference between the estimate and true 
value), in red, and the MC standard deviation, in blue. The MC bias and standard deviation points are fitted 
with least-squares curves proportional to the standard MC rates with which the accuracy of the estimates is 
expected to improve. With regards to the variability of the estimates one sees the expected reduction in variability 
as N increases. The bias is harder to quantify; it will not necessarily be the case that as iV grows the bias falls. This 
is because there is a Monte Carlo bias (from the SMC), an optimization bias (from the SPSA), an approximation 
bias (from the ABC) and the fact that the data have been generated from the model (so the true static parameters 
might not be exact). Increasing N can only deal with the SMC bias (which for estimates with parameters fixed is 
0{N~^)), but the addition of parameter estimation again does not make it easy to understand what happens here. 
The main point is simply as expected; one obtains significantly more reproducible/consistent results as N grows. 




, s -JV.IO 



W <^1,5000 



/ N -N,10 
(C) (T„„' 



W Pi, 5000 




• bias 

• stde\ 



• bii 

• stde\ 



/ ^ -JV,10 

y'^) '^1,5000 



/r\ ~N,10 

y'^) ""i.sooc 



/ N -]V,10 



ru\ -iV.lO 
W Pi, 5000 



Fi gure 5: ^1.5000 when estimating 9 — (k, c, 1753, p) of the Lorenz '63 HMM, using ABC-SMC with values of 

in boxplots and their true values in dotted green lines. 



N G {100, 1000, 10000}. Figures 5(a)|5(d) 



show the 9 



'N.IO 
1,5000 



Figures 5(e)|5(h) show the MC bias and MC standard deviation of the 6*1^5000; ™ I'ed and blue, with curves of least 



squared-error cx 



Next w e look at the influence of the pseudo-observations. For M e {1,3,5,10,25,50}, we show in Figures 
6(a)|6(d) the boxplots of the terminal estimates ^"5^00^ from fifty independent runs of ABC-SMC, using N = 5000 
. The dotted green lines marks the true 9 values which generate the data. In Figures 6(e)|6(h)"| the MC 



and e — 
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biases and the MC standard deviations of the 9i ^qqq are plotted point-wise, in red and blue, with lines of least 
squared-error fit to them. As M increases, we see reductions in the MC variance. This reduction in variance can be 
attributed to the fact that the ABC-SMC algorithm approximates an algorithm that does not simulate the pseudo 
data; hence by a Rao-Blackwellization argument, one expects a reduction in variance. These results are consistent 
with [12]. For this example, after M > 5, there seems to be little impact on the accuracy of the estimates; it is not 
clear whether such performance occurs for other examples. 




Figure 6: ^™ooo^ when estimating 9 — {K,a,aQ3, p) of the Lorenz '63 HMM, using ABC-SMC with values of 

show the ^^o'oo^ in boxplots and their true values in dotted green lines. 



M G {1 3,5 10,2 5, 50}. Figures 6(aj|[6(d) 



Figures 6(e)|6(h) show the MC bias and MC standard deviation of the ^^sodo^i in red and blue, with lines of least 
squared-error. 



We now vary n; for n € {5000, 10, 000, 15, 000}. 



M = 10, and e = 1, and plotted boxplots of the terminal estimates 9^ ^' , in Figures 7(a) 



We ran fifty independent rims of ABC-SMC using iV = 200, 

against the true 



7(d) 



values of 9 marked in dotted green lines. Recall that recursive maximum likelihood estimation tries to maximise 
^ \og{pe,e{yi:n)), SO wc cxpcct n not to have a great effect on the bias nor the variance (also due to the bias results 
in Section 2.3 and the subsequent consistency results in j8l[9]). This is confirmed in Figures 7(e)|7(h) where the 
absolute value of the MC biases and the MC standard deviations have been plotted in red and blue, and fitted with 
linear lines of least squared-error. 

Finally, we investigate the influence of e e {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 50}. For each e, we again ran fifty independent 
runs of ABC-SMC with N = 200 and M = 10, for the dataset n = 5000. The boxplot of the parameter estimates 
are plotted, in Figures 8(a)||8(d) against dotted green lines which indicate the true 9. Figures [8(e)|8(h)1 show the 
absolute value of MC biases in red, and the MC standard deviations in blue. Fitt ed t o the MC biases is a non-linear 
least squares curve proportional to e -I- i. The result we presented in Section 2.3 states that as e increases, the 



bias will increase on 0{e), hence the term proportional to e of the fitted curve. However, the ABC-SMC algorithm 
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Figure 7: 6^ „' when using ABC-SMC to estimate 9 — (k, a, o-gSi p) of the Lorenz '63 HMM, for datasets of length 

n £ {5000, 10000, 15000}. Figures 7(a)|7(d) show the 9^ ^^' in boxplots and their true values in dotted green lines. 

7(h) show the MC bias and MC standard deviation of the 6*]^ , in red and blue, with lines of least 
squared-error. 



becomes less stable for e too small (in the sense that, for example, the variance of the weights will become larger 
as e grows), incurring more varied estimates and affected biases; thus the term proportional to -. Fitted to the 
MC standard deviations is a non- linear least squares curves proportional to ^. For this example, the MC standard 
deviation decreases at this rate as e increases. 



5 Summary 

In this article we have presented a technique to perform online parameter estimation using ABC-SMC and SPSA 
for HMMs. This is useful for models where the state-dimension is high and the parameter and observations are of 
moderate dimension. In addition, it is required when the conditional density of the observations given the hidden 
state is intractable. 

Some future work is as follows. The representation in Remark |3.1| can be potentially useful for alternative 
online parameter estimation techniques, other than using SPSA. In |13j we are investigating the use of the online 
EM algorithm [3T] and any potential benefit that it may have over the ideas in this paper. We have remarked 
that one drawback of the SMC algorithm implemented is its inability to deal with small e. Two potential ways to 
proceed are as follows. One is to introduce a further approximation by the expectation-propagation algorithm (as 
in [2]) and potentially removing SMC altogether. The other is to consider more advanced SMC approaches such 
as [TT] and how this might help one reduce e; this is an area of ongoing research. We are also considering ABC 
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Figure 8 

e G {1, 2,3,..., 10, 50}. Figures 8(a)||8(d) show the MC biases and their curves of non-hncar least squared-error 
proportional to e + ^ in red, and the MC standard deviations with their curves of non-linear least squared-error 
proportional to - in blue 



approximations in the scenario of deterministic dynamics for the hidden state; these models have wide application 
in applied mathematics as filtering initial conditions of partial differential equations. 
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A Notations 

We introduce a round of notations. As our analysis will rely upon that in |29) our notations will follow that article. 
It is remarked that under our assumptions, one can establish the same assumptions as in [23 . Moreover, the time- 
inhomogenous upper-bounds in that paper can be made time-homogenous (albeit less tight) under our assumptions. 
In addition, our proof strategy follows ideas in |T]. 

Sb(X) is the class of bounded and real- valued measurable functions on X. Throughout, for if E Sb(X), ||iy9||oo ■= 
supjjgx For ip e Bb{X) and any operator Q : X ^ A4(X), Q{(p){x) J^ip{y)Q{x,dy). In addition for 

He e A^(X), HeQi^p) := J.^fieidx)Q{(p){x). 

We introduce the non-negative operator: 

Re,n{x,dx') := g0{yn\x')j0{x'\x)dx' 
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with the ABC equivalent Rg^t.nix^dx') :— ge^c{ynW)fe{x'\x)dx', gg^^{y\x) — g(u\x)dy/ dy. To keep 
consistency with [55] and to allow the reader to follow the proofs, we note that the filter at time n > 0, Fg{iig) 
(resp. ABC filter, at time n, F^^{iig)) is exactly, with initial distribution S 'P(X) and test function cp S Bi,{X) 

resp. 

^g,eW)[V) - — ^ TTV 

where F^{fj.g) = F^^tJ^e) = Me, ^i,n,e(</')(a;o) = j ]Xk=i^kfi[xk-i,dxu)'p{xn)- In addition, we write the filter 
derivatives as Fg{fig,'fig)(ip), Fg^{fig,'jlg){ip) where the second argument is the gradient of the initial measure. 
The following operators will be used below, for n > 1: 

G''{fig,jrg){ip) := {^igRl,n.g{l)r^[JrgRl,nA^)~J^eRl,nMWFS'Mi^)] (10) 
HAfig)iip) F^-\^ig)R,,A^)-'[Fr'MRnAv>)-Fr\^^e)R^^^^ (11) 

with the convention G^{fj,g,'flg){ip) = Jig. In addition, we set 

G^''\^ig,Jrg){ip) itigRnAAr^U^eRnAv)-'i^eRnA^)F^''\t^e)M- 

where Fg"'\^g) — ^igRnfi / ^^0Rn.,e{^)■ Finally, an important notational convention is as follows. Throughout we use 
C to denote a constant whose value may change from line-to-line in the calculations. This constant will typically 
not depend upon important parameters such as e and n and any important dependencies will be highlighted. 

B Bias of the Log-Likelihood 

Proof of Proposition \2.1\ We begin with the equality 



^Og{pg{yi:n)) - \0g{pgAyi:n)) = ^ [ log(pe (l/fe | yi:fc-l ) ) " log(pe,e (^/fc | yi:fe- 1 )) ) (12) 

fc=l ^ ^ 

with, for 1 < k < n 

Pe{yk\yi:k-i) = / gg{yk\xk)f0{xk\xk^i)Fg^^{iig){dxk^i)dxk 

PeAyk\yi:k~l) = / gg{yk\xk)f0{xk\xk-l)Fg~'^{ng){dXk~-l)dXk- 



We will consider each summand in (12). The case k > 2 is only considered; the scenario k — 1 will follow a similar 
and simpler argument. 

Using the inequality | log(x) — log{y)\ < \x — y\/ {x A y) for every x,y > we have 

n / / I ^^ w / I ^ \Pe{yk\yi:k-i) - PeAVklyi-.k-A 

log U/fe yi:fe-l - log Pe,£(yfc 2/l:/c-l < . 7 . ^■ 

pe[yk\yi:k-i) Apg^e{yk\yi:k~i) 

Note that 

pe{yk\yi:k~i) ^pe{yk\yi:k~i) = 
/ gg{yk\xk)fg{xk\xk^i)Fg~^{fig){dxk~i)dxk A gg{yk\xk)fe{xk\xk^i)FgAif^o){dxk-i)dxk>C>0 (13) 
where we have applied and C does not depend upon e. Thus we consider 

\peAyk\yi:k-i) - pe{yk\yi:k-i)\ = 

1/ gg{yk\xk)fe{xk\xk-i)Fg^^{ng){dxk^i)dxk - / gg.e{yk\xk)fe{xk\xk-i)Fg~^{^ig){dxk^i)dxk\. 

JX2 JX2 
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The R.H.S. can be upper-bounded by the sum of 

[goivklxk) - ge,e{yk\xk)]feixk\xk~i)Fg~'^{^ie){dxk~i)dxk\ 



X2 



and 



I / ge,e{yk\xk)fe{xk\xk-i)[Fg ^^{n9){dxk-i) - Fg ^'^{fig){dxk-i])dxk\. 
The first expression can be dealt with by using (j'^, which imphes 



sup \ge,eiyk\x) - geAyk\x)\ < Ce. (14) 

The second expression can be controlled by pSl Theorem 2]: 

svip\\F^-\tie)-F^;\^ie)\\<Ce (15) 

k>l 

to yield that 

\pe.Ayk\yi:k-i) - Pe{yk\yi:k^i)\ < Ce. (16) 
One can thus conclude. □ 

C Bias of the Gradient of the Log-Likehhood 



Proof of Theorem 2. 1 We have that 



V^logp9(j/i:„) - logpe,<:(yi:n)^ = | ^ {^og\pe{yk\yi:k-i) - \og[peAyk\yi:k-i, 
ft then follows that 

V^logpe(yi:n) - logPe,e(yi:„) 

AVpe(yfe|2/l:fc-l) - Vpe,e(2/fc|yi:fe-l)] , yVeAyk\yi:k-l) \ t \ \ t \ m\ 

V r~\ ^ + ~^~\ ^ — rn ^[peAykWi-.k-i) ~ Pe{yk\yi:k~i)\ ■ 17) 

^ \ pe(.yk\yi:k-i) Pe[yk\yi:k-i)peAVk\yi:k-i) ' 



k=l 



We will deal with the two terms on the R.H.S. of (171 in turn. The scenario A: > 2 is only considered; the case 
k = 1 follows a similar and simpler argument. 
First starting with summand 

[Vpe(yfe|yi:fe-i) - Vpe,e(yfe|yi:fe_i)] 



Pe{yk\yi:k-i) 



Noting (13), we need only upper-bound the Li norm of the following expression 



'^{9eiyk\xk)}fe{xk\xk-i)Fg {p0){dxk-i)dxk ~ '^{geAyk\xk)}fe{xk\xk^i)Fg\ {iJLe){dxk^i)dxk (18) 
+ / ge{yk\xk)'^{fe{xk\xk-i)}Fg~'^{^.e){dxk-i)dxk- / 9e,e{Vk\xk)'^ {fe{xk\xk-i)}Fg~'^ {^.e){dxk-i)dxk (19) 

JX2 JX2 

+ / ge{yk\xk)f0{xk\xk-i)Fg^^{}Xe,V^e){dxk^i)dxk~ / g0Ayk\xk)fe{xk\xk-i)Fg^'^{^ie,Vi){dxk^i)dxk. (20) 

JX2 JX2 

We start with (18 1. Using (A[4| we can establish that for each fc > 1 

sup \V{ge{yk\xk)} - '^{9eAyk\xk)}\ < Ce (21) 

where C does not depend upon k, e. Hence 

N{9e{yk\xk)} - V{g0Ayk\xk)}]feixk\xk-i)Fg''^{pg){dxk-i)dxk\ < Ce. 
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Then we note that by [H Theorem 2] (see ^) and {J^ 

'^{geAyk\xk)}feixk\xk-i)[Fg^^{fie){dxk-i) - Fg^^ ipg){dxk-i)]dxk\ < Ce 



X2 



Thus we have shown that 



y{geiyk\xk)}fe{xk\xk~i)Fg {p.0){dxk^i)dxk ~ i ^ {g0Ayk\xk)} fe{xk\xk-i)Fg ^ {pLg){dxk-i)dxk\ < Cn. 

JX2 



Now, moving onto (19), by (14) we have 



X2 



[99{yk\xk) - 9e,e{yk\xk)]'^ {fe{xk\xk~i)}Fg {^ig){dxk-i)dxk\ < Ce 



and can again use JHl Theorem 2] (i.e. (15)) to deduce that 



X2 



{yk\xkW{fe{xk\xk~i)}[F; '{ijg){dxk-i) - F^~'{ng){dxk^i)]dxk\ < Ce 



and thus that 



X2 



geiyk\xkW{f6ixk\xk-i)}Fg {^ie){dxk~i)dxk- I g0Ayk\xk)^ {f0{xk\xk-i)}Fg^^ {^i0){dxk-i)dxk\ < Ce 



X2 



which upper-bounds the expression in (19). We now move onto (20), which upper-bounded by 



[ge{yk\xk) - ge.e{yk\xk)]f0ixk\xk-i)Fg ^{^ie, ^J■e)idxk-l)dxk\ + 



X2 



geAyk\xk)fe{xk\xk-i)[Fg ^{i^e, iie)idxk-i) - Fg ^ \i2e, iJ.e)idxk-i)]dxk 



k-l, 



X2 



For the first expression, we can write: 



{snp \ge{yk\x) - ge,e{yk\x) 



[ge{yk\xk) - geAyk\xk)] 
xKJx (sup^ex \geiyk\x) - ge,eiyk\x)\)' 



fe{xk\xk-i)dxk ]Fg ^{iig,^ig){dxk-i)\ 



Then we can apply (14) and, noting that 

[9e{yk\xk) - geAyk\xk)] 



-Je{xk\xk-i)dxk < 1 



/x (sup^ex \ge{yk\x) - geAyk\x)\)' 
one can also use Lemma FD. 31 to deduce that 

[ge{yk\xk) - ge,t{yk\xk)]f0{xk\xk-i)Fg^^{ng,JIi)){dxk-i)dxk\ < C(l + ||Aie||)e. 



X2 



Then, one can easily apply Theorem |D.1| to show that 



X2 



ge.e{yk\xk)feixk\xk~i)[Fg {^J■e, fJ'0)idxk-i) - Fg^^ {^J.g, ^g){dxk-i)]dxk\ < C{2 + \\fig\\) 



Thus we have upper-bounded the Li— norm of the sum of the expressions (18)-(20l and we have established that 

[^pe{yk\yi:k-i) - ^peAyk\yi:k-i)] 



pe{yk\yi:k-i) 



<C(2+|1^||K 



(22) 



Moving onto the second summand on the R.H.S. of (17) 

^Pe,e{yk\yi:k-l) 



pe{yk\yi:k-i)pe.Ayk\yi:k-i) 



[peAyk\yi:k-i) - P0{yk\yl■.k-l)■ 
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By (16), we need only consider upper-bounding, in Li, Vpe.e(2/fc|yi:fc-i)- This can be decomposed into the sum of 
three expressions: 

'^{geAyk\xk)}fe{xk\xk~i)Fg~^{ng){dxk-i)dxk 



X2 



X2 



9e,eiyk\xk)'^{fe{xk\xk-i)}Fg ^ {fig){dxk-i)dxk 



and 



ge,eiyk\xk)f9{xk\xk-i)Fg^ ^{y.e, lie)[dxk-i)dxk 



X2 



As "S/ {g0_e{yk\xk)} and ge^e{yk\xk)'^ {f9{xk\xk-i)} are upper-bounded as well as X being compact the first two 
expressions are upper-bounded in Li. In addition as jy^ge,c{yk\xk)f0{xk\xk-i)dxk is upper-bounded, we can apply 
Lemma |D.3| to see that the third expression is upper-bounded in Li. Hence, we have shown that 



^PeAVk\Vl:k-l) 



pe{yk\yi:k-i)p9Ayk\yi:k~i) 



[peAyk\yi:k-i) ~ pe{yk\yi:k-i)] 



<C(l + ||M^||)e. 



Combining the results (|22|-(23) and noting (17) we can conclude. 



(23) 

□ 



D Bias of the Gradient of the Filter 



Theorem D.l. Assume (A 1-5). Then there exist a C < +oo such that for any n > I, Cz 'P(X), fig e A4(X), 
e > 0. 6* e 6; 

\\F;\fie,frg)-F^Af'o,ire)\\<Cei2 + \\jrg\\). 
Proof. We have the following telescoping sum decomposition (e.g. |10^) for the differences in the filters, with ip S 



f,"(m,)(^)-f,",(H(^)-E 



n— J3+2,n/ T-in— p+1 , 



where we are using the notation Fg'^\fj,g){ip) — ^^^^'''"'^(ij i for 1 < q < n. Hence, taking gradients and swapping 

the order of summation and differentiation we have and omitting the second arguments of F on the R.H.S. (to 
reduce the notational burden) 



p=i 



(24) 



Fr'^''AFt'^'\Frj{t^e%F^--^+'\F-;%figm^) 
To continue with the proof we will adopt |29i Lemma 6.4]: 

n 

F^{fig,jrg){v) = G^^{fig,jrg) + Y,&g'^''AF^M,H'^{f^e)^^^^ 

9=1 

with Gg and H'^{fig) defined in (lOl-(ll) and Gg^^'" similar extension to the notation as for the filter Fg^^'"' and 
the convention Gg^^'" {fig, Jig) = Jig. Returning to (24) and again omitting the second arguments of F on the 
R.H.S. : 

Fe{fl9,jrg){ip) - F^Jfig,jrg)iip) = 



E 



E {GV''''{Fr'^''nFl,''-'^'\F-;^if,g))],H--^^^^^ 
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We start first with the summand on the R.H.S. of the second line of ([25]), which we compactly denote as: 
This can be decomposed further into the sum of 

and _ _ _ _ 



(25) 



(26) 
(27) 



Beginning with (26), by [551 Lemma 6.7], equation (43) we have 

\&r'{Fe[Fr/{l^o)]MFZJit^o)]W)~Gr\FeAF^^^^^ 

<CM\ooff-'\\Fe[F;;/{^,e)]-FeAFZnMm^^^ 
where p £ (0, 1) and C do not depend upon fie,£ or n,p. Applying Lemma D.2| we have 

\Gr'{Fe[Fl7M],Fo[F--^if,emv)-&o''{^^^^^ 

<CMooP''-'e\\Fe[F;::^{^^e)]\\ 

" F,[F"7^(M«)]||<C(2+||M^||) 



where C does not depend upon pg, e or n,p. Then by Remark 
and thus the upper-bound on the Li— norm of (26): 



D.l 



and Lemma 



D.3 



n~'{Fe[F;::^{pe)],Fe[F-:P{po)]W)-G^,''{FeAF^:'{pe)],Fe[^^^^^^^ (28) 



Now, moving onto (27), by [29, Lemma 6.7], equation (42): 

\Gl'{FeAFl7ipo)lFo[F;A''{f-iemv)~Gl''{FoAFe^^^ 

< CpP-^^\U\Fe[F^;^{ne)] - F,,.^^ (/.,)] ||. 

Applying Lemma [D.l | 

\&o~'{FoAFoA'MhFo[F'e-^{pgmv)-G^^^^ 

<Gy\UpP-\l + \\F^-Pipg)\\). 

Then by Lemma [P. 3 [ we deduce that 
\G^-'{FeAFl;■''{pe)],h[Fl7i^^e)]m~G^,-'{FoAF^A^ 

(29) 



Combining (28 1 and (29) 



\Gl-'{Fe[F-:'\po)lFo[F;:^{pem^)~Gl-'{FeAF^:^ipe)],FeAF^^^ 

(30) 

We now consider the summands over q in the second and third lines of ([25|. Again, adopting the compact 
notation above we can decompose the summands over q into the sum of 



G^HFAFoiF^:^{^iemm[Fo{Fl:^i^^e))]}{^)^G^HF^[FeAFe:^ (31) 



and 



GrHF^[FeAFZ'M)iH^e[Fe{F"7if-i0)m^)^GrHF^[FoAFe:'i^^^^^^ (32) 



where s = q — n+p — I. We start with (31); by i[29i Lemma 6.7] equation (43), we have 
\&rHFAFeiF--PM)hmFo{F-:Pifio)m^^ 
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<CM\^p--^F|[Fe{F;~J{^Jie))]-F|[FeAFZ'{^^e))]^^^^^^^ 
Then we will use the stability of the filter (e.g. [221 Theorem 3.1]) 



By Lemma D.2 \\FgiF^;P{fXg)) - Fe^,{F^;P{t^g))\\ < Ce and thus 



\Gr'{Fe[Fe{F-;^M)],H^[Fe{F-;P{fie))]m^G'r^ 

<CMUpP-'\mFo{F--Pif,em\- 

By [23 Lemma 6.8] we have \\Hg[Fg{Fg~^ {p,g))]\\ < C, where C does not depend upon Fg{Fg^^ {pg)) or e and 
hence 

\Gr'{F|[Fe{F^;^{^^e))],H|[Fg{Fly{^,sm^^ 

Now, turning to ( [32| and applying [29, Lemma 6.7] (42) we have 

\G'l-''{F^[FgAFZJ{pg))]MFg{FZ''{t^e)mv^ 

< C||^|Up"-''||i?|[F,(F,7^(/i,))] -i?|[F,,,(F,';7(^,))]||. (33) 
Then by [29, Lemma 6.8] we have 

imFgiF^-^M)] ~ Fll{Fg.,{FZ''{.iiB))\\\ < Cp^WFeiF^-nif^g) - Fg,,{F--/{pg))\\ 



and then on applying Lemma |D.2| we thus have that 

\\Hl{Fg{F--/){^,e)) - Hl[Fl7'+'){pg)\\ < Cep\ 
Returning to ( [33| , it follows by the above calculations that: 

\G^HF^[Fg{F-:^i^^e))],mFg{F-;Pi^^emi^^^ 
Thus we have proved that 

\G--^F^[Fg{Fl:^{f,emH^g[Fg{F;];Pifiom{v^^^^ 

(34) 



Then, returning to 125^ and noting pOL (34 1 we have the upper-bound 



\\F^{pg,Tre)~F^^,{t,g,jre)\\<Ce{2+\\jre\\)Y.^pP-'+ E P^"'] < Ce(2 + ||/I^||). 

p— 1 q—n—p 

□ 

D.l Technical Results for ABC Bias of the Filter-Derivative 

Lemma D.l. Assume (Al-5). Then there exist aC < +oo such that for any n> \, pg ^ 'PO^)^ 1m) € A^(X), e > 
6* e e.- 

\\Flr\pe,P9)-Fl,'^}{pg,pg)\\ < Ceil + \\pg\\)- 
Proof. By [29l Lemma 6.7] we have the decomposition, for ip e Bb{X): 

F^"\pg,P0){^) = &-\po,pg){^)-Hj,-\pg){^) 

where 

H'^"-\pg){ip) := pgRn.g[iy^[pgRnfi{'p) - pgRn,e{l)pe[v)- 
Thus to control the difference, we can consider the two differences G^g'\pg, 'pg){ip)—G^^l{pg,'pg){ip) and Hg^\pg){(p) — 
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Control of G^g'\pLo , ^o){ip) — G^g'l{^e, ^^e){^)■ We will use the Hahn- Jordan decomposition: ~ — fj,g . 
It is assumed that both Jlg'^ {l),'flg~ (!) > 0. The scenario with either JIg^{l) = or = is straightforward 

and omitted for brevity. We can write: 



where /ig (•) = fig'^ {■) / f-ig^ (1) and iig (•) — ng {■)/fJ,g (1). Thus we have 



Gg"'\^^e,fJ'0){y^) - G'g'^^{flg,flg){ip) 



in), 



in). 



in). 



By symmetry, we need only consider the terms including Jig^; one can treat those with Jig by using similar 
arguments. First dealing with term on the first line of the R.H.S. of (35). We have that 

+i?„,e(l) M^+i?„,e,.(l)" 



fJ-eRn,e{^) fJ-eRn,e.e{^) 



\- fig nn,g,e(^) 



Now by (i^, for any n 



flgRn,0,e{'^)l^sRn,e{^) 



sup \Rn,g{l){x) - Rn,e.,{^){x)\ < Ce 



in). 



(36) 



thus 



h Me ^ri,e,£Uj 



Now one can show that there exist a C < +oo such that for any x, i/ G X 

Rn,em^) > CRn,e{l){y) i?„,e,,(l)(a;) > Ci?„,e,e(l)(2/) 

Then it follows that 



(37) 



Hence we have shown that 

-+ 



Me^n,e(l) Me^n,e,e(l) 



[f(")(m^+)(^) - f^r(Me)(^)] < C||^|UeM^"(l). 



Second, the second line of the R.H.S. of ([35|. By Lemma [p^ for any fig e V{X), \\F^"'\fig) - F^"^^ {fig)\\ < Ce, 
with G independent of fig, and in addition using (37) we have 



Me^-Rn,e,e(l) rp(n)/~+x 



[Fr(M^")(</') - J^r(Me)(¥') - F^'l>(fIg-){^) + F,7(/x,)(^)] < GM^efTg^il) 
Me^n,e,e(,tj 

Thus we have shown: 

0;'\flg,jrg){ip) ~ G'^^}iflg,jrg){^)\\ < Ge[jrg+{1) +jrg-{^^^ 



An)^^+ 



in). 



(38) 
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Control of H'f'\ne){ip) - H^J'J {ng){ip). We have 



in), 



fJ-eRn^W 



We start with the first bracket on the R.H.S. of (39). We first note that 



Rn,e{f)ix) - Rn,eA^){x) = / fe{x'\xMx')[Vge{yn\x') - VgeAynW)]dx' < Cy\\ 



(39) 
(40) 



where we have applied (21). Then wc have 

HgRn,0{<f) HeRn.eA'f) fJ.eRn.0if) - f^6Rn,e.€{'P) , n / . fioRnfi A^) - ^^0RnA^) 

^ \- fJe-nn,e,e{f)- 



Me-Rn,e(l) fJ'eRn,eA^) 



A*e^n,6l,e(l)M0-Rn,e(l) 



By using (40) on the first term on the R.H.S. of the above equation and by using (36) in the numerator for the 
second, along with (37) in the denominator, we have 

fieRn,0{f) fieRnfiAf) 



< Ce[||(^|loo + \fieRn,eAf)\]- 



fJ-eRn,e{^) l^-eRnfiA^) 

Then as 

RnfiAf){^) = j f{x)[^geAyn\x')fe{x'\x) - gBAyn\x)V fB{x\x)]dx <C\\ip\\oo I dx'<C\\ip\\. 
where the compactness of X and (A[5]) have been used, we have the upper-bound 

f^eRn,e{'p) fJ'eRn^eAv) 



fJ-eRnfii^) fJeRn,e,e{^) 



<C|l^|Ue. 



(41) 



(42) 



Moving onto the second bracket on the R.H.S. of (39), this is equal to 

fJeRnfiA^) fJeRnfii^) 
jieRnfiA^) neRn,e{i) 

By using the inequality ( [42| , we have 

' figRnfiA^) fJ.eRn.e{i) 



Fi'iMif) + ^^^f^[F£)(/..)(^) - F^"\f.om] 



fJeRnfiA^) fJeRnfiiX) 



F^':\t^em < Ce|F(J(M.)(^)| < CM^e. 



Using Lemma D.2 and in addition using (37) in the denominator and (41) in the numerator we have 

^'^"■'lli [F^l (^) - F^-^ ifJeKf)] <CMU 
where C does not depend upon fig and e. Thus we have established that 

fieRn,eA'^)Fi"\fl0){v) flgR^Al)F^''\fig)i^) 



fJ'eRn,e,e{^) 



fJ-eRruoi^) 



< C||(y5||ooe- 



One can put together the results of ( 42 ) and ( 43 ) and establish that 

\H^,"\flg)iv)~Hj^''^{f,0){^)\<Cy\U. 



On combining the results ( 38 ) and ( 44 1 and noting ( 39 1 we conclude the proof. 
Lemma D.2. Assume (Al-3). Then there exist a C < +oo such that for any n > 1, fig Cz V{X), e> 0, £ Q: 

\\F^'^\f,g)~F^:'^{fig)\\<Ce. 



(43) 

(44) 

□ 
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Proof. For ip £ Bbif) 



(«)/ 



+ fJ-eRn.e.ei^) 



f^eRn.,e,e{^)f^eRn,e{^) 



Then by applying (36) on both terms on the R.H.S. we have the upper-bound 

l^eRnfiW 



One ean eonclude by using the inequahty (37) for i?„,6i(l)(-). 



□ 



Lemma D.3. Assume (Al-5). Then there exist a C < +00 such that for any n > 1, £ V{X), fig e A^(X), 
e > 0. 6* e 6; 

\\F^{^ie.lre)\\ V \\Fl^{^ie.f^e)\\ < C(l + 

Proof. We win consider only Fg{fig,'fie) as the ABC filter derivative will follow similar calculations, for any e > 
(with upper-bounds that are independent of e). By |29l Lemma 6.4] we have for e yBh(X) 

n 

p=i 

By [221 Lemma 6.6] we have the upper-bound 

n 

with p e (0, 1). Then by Lemma 6.8], it follows that 

n 

\\Kit^e,jre)\\ <c{p-\\Tre\\ + Y.p''~'') 



from which one concludes. 



□ 



Remark D.l. Using the proof above, one can also show that there exist a C < +00 such that for any n > 1, 
He e ■P(X), JTe e MiX), e > 0, 6I e 6 



Flr\tJie.jre)\\ V \\F^':>[f,e.7re)\\ < + 
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